Free curand states before the thread is destroyed#1912
Open
no1d wants to merge 1 commit intoOpenNMT:masterfrom
Open
Free curand states before the thread is destroyed#1912no1d wants to merge 1 commit intoOpenNMT:masterfrom
no1d wants to merge 1 commit intoOpenNMT:masterfrom
Conversation
|
I've had this issue too... running cuda 12.4 4080s 0xC0000409 (3221226505) = Windows’ stack buffer overrun.. ive literally tried to resolve on my own for the last year. |
Contributor
Contributor
|
Hi, @Purfview, |
6 tasks
morganjeremiah7
pushed a commit
to morganjeremiah7/hush-profanity
that referenced
this pull request
Apr 28, 2026
…rash
Two changes that go together:
1) Stack upgrade — removes the cuDNN 8 / cuDNN 9 dual-load
- torch 2.5.1+cu121 -> 2.8.0+cu126
- ctranslate2 4.4.0 -> 4.7.1 (uses cuDNN 9 natively)
- whisperx 3.4.5 -> 3.8.5
- nvidia-cudnn-cu12==8.9.7.29 -> removed (torch's bundled cuDNN 9 is
now the only one in the process)
- install-windows.ps1, pyproject.toml, requirements.txt updated.
This alone did not fix the crash: even with the cleaner stack, python
still died on the 2nd file with the same KERNELBASE 0xe06d7363 +
ucrtbase 0xC0000409 signature.
2) Subprocess-per-file transcription — the bulletproof workaround for
OpenNMT/CTranslate2#1912 / faster-whisper#71/#1293. ctranslate2's CUDA
cleanup path corrupts the heap when WhisperModel is destroyed; the
corruption gets touched fatally after 1-3 destruct/reconstruct cycles
in one process. The fix recommended by the upstream issue threads is
to run each transcription in its own process and let OS-level CUDA
context teardown bypass the buggy cleanup path.
New module src/hush_profanity/_transcribe_worker.py:
- JSON-in, JSON-out contract (config in, words out, both via temp files)
- exit codes: 0 success, 1 config/IO error, 2 transcribe error, >2 unknown
- stderr captured by parent and forwarded to the main log
scanner.gpu_worker now spawns this worker per file via subprocess.run
with a 30 min timeout. If the subprocess crashes (which it shouldn't,
but if ctranslate2 gets weird) the parent catches RuntimeError and
marks just that file as failed, then continues with the next.
Verified: 3 sequential subprocess transcriptions on CUDA all exit clean.
The in-process version of the same test crashed on the 3rd run.
Cost: ~5-10 s subprocess startup per file. Negligible vs the alternative
(crash after 1-2 files).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morganjeremiah7
pushed a commit
to morganjeremiah7/hush-profanity
that referenced
this pull request
Apr 28, 2026
…-whisper Why: ctranslate2 has a long-standing CUDA cleanup crash on Windows (OpenNMT/CTranslate2#1912, faster-whisper#71/#1293) that we hit reliably across every version (4.4.0 → 4.7.1) and every workaround we tried: - int8 quantization (Test 1) — VRAM dropped 22GB → 11GB but crashes persisted, ruling out memory exhaustion as the cause - alignment off (Test 3) — removed PyTorch from the GPU entirely so only ctranslate2 was a CUDA library; still crashed, ruling out the dual-allocator theory - stack rollback to ct2 4.4.0 / cu121 / cuDNN 8 (Test 2) — exactly the version that did 49 files in a row originally; still crashed, so the bug is in ctranslate2 itself regardless of version - subprocess isolation — kept the parent alive when workers crashed but still lost ~30% of files per scan The cure was replacing the engine. openai-whisper is the reference PyTorch implementation. Slower (~3-4× per file) but rock-solid: same PyTorch CUDA stack as the wav2vec2 alignment in WhisperX, so only one CUDA allocator in the process. Verified with sequential subprocess test (alignment ON, real CUDA) — 3/3 clean exits where the in-process ctranslate2 version crashed every time on the 3rd run. Then 7/7 successful overnight scan on the 8 files that had previously failed. Other changes: - transcribe.py: full rewrite around openai-whisper API. Same Word dataclass output. Subprocess pattern preserved for belt-and-suspenders. - verbose=None instead of False to suppress the tqdm progress bar that was polluting the worker stderr → main log. - install-windows.ps1: drops nvidia-cudnn-cu12==8.9.7.29 (no longer needed — torch's bundled cuDNN is sufficient). Adds triton-windows so openai-whisper's word-timestamp DTW kernels run on GPU instead of falling back to a much slower pure-PyTorch path. - pyproject.toml + requirements.txt: pin openai-whisper, whisperx<3.5 (3.5+ pulls ctranslate2 back in transitively), torch 2.5.1+cu121, triton-windows 3.1.0.post17 (windows-only). - settings.example.toml: clarify that compute_type, vad_filter, and whisper_batch_size are now ignored / mapped because openai-whisper has no equivalent to faster-whisper's batched pipeline. - .gitignore: add .claude/ for agent-tool local state. Speed cost on a 3090: each file takes ~6-8 min instead of ~3 min, so an 83-file overnight scan goes from ~4-5 hr to ~6-10 hr. Acceptable for a stack that doesn't crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tried #1201 with no luck, so this should fix SYSTRAN/faster-whisper#71